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We show how to accelerate relativistic hydrodynamics simulations using graphic cards (graphic 
processing units, GPUs). These improvements are of highest relevance e.g. to the field of high- 
energetic nucleus-nucleus collisions at RHIC and LHC where (ideal and dissipative) relativistic hy- 
drodynamics is used to calculate the evolution of hot and dense QCD matter. The results reported 
here are based on the Sharp And Smooth Transport Algorithm (SHASTA), which is employed in 
many hydrodynamical models and hybrid simulation packages, e.g. the Ultrarelativistic Quantum 
Molecular Dynamics model (UrQMD). We have redesigned the SHASTA using the OpenCL comput- 
ing framework to work on accelerators like graphic processing units (GPUs) as well as on multi-core 
processors. With the redesign of the algorithm the hydrodynamic calculations have been accelerated 
by a factor 160 allowing for event-by-event calculations and better statistics in hybrid calculations. 



* |jochen.gerhard@compeng.um-frankfurt.de| Tel.: +4969 798 44112; Fax: +4969 798 44109 



2 



I. INTRODUCTION 

Relativistic hydrodynamics l I] has a long history since its first application in nuclear collisions [2] and for the evolution 
of the universe [3]. It is still an excellent tool to describe systems of different scales, ranging from collisions of galaxies 
down to collisions of heavy ions or even protons at LHC[J]. All these simulations need to respect the relativistic 
nature of the dynamics, however the focus lies on different aspects. While astronomical calculations have to include 
viscosity and turbulences, heavy ion simulations can neglect turbulences. Here, algorithms that have the ability to 
capture shock phenomena or relatively sharp gradients (and potentially very small viscosity) have to be applied. For 
the present paper we focus on nuclear collisions, however, the methods can also be transfered to cosmological and 
supernova simulations. 

A large body of current numerical tools to explore and interprete the experimental data obtained in high energy 
experiments employs relativistic hydrodynamics in various facets: On the purely hydrodynamic side there are the 
models of Heinz and Kolb[5], the model of Romatschke[6], Csernai's[7j hydrodynamic model, and various multi fluid 
approaches, e.g. by Toneev[51 and Brachmann and Dumitru|10j. In the recent years also hybrid approaches 
between Monte-Carlo / Boltzmann event simulators and hydrodynamics have gained substantial popularity. Most 
noteworthy in this respect are the approaches of Hirano and Nara (hydrodynamics+JAM)[ll , Blcicher and Petersen 
(UrQMD 3.3)[12] using the SHASTA, NeXSPheRIO [L3] based on smoothed particle hydrodynamics, EPoS[H], Bass and 
Nonaka[T5] realizing a Lagrange approach coupled to UrQMD, the hydro-kinetic model (HKM)JTB], and MUSIC [17]. 
For the numerical solution different methods are applied ranging from smooth particle methods, over the Kurganov- 
Tadmor[TH] algorithm, to the SHASTA[T!jj. The present examples are based on the SHASTA as it is employed in 
UrQMD, which offers since version UrQMD 3.3 a hybrid transport model [12]. based on the relativistic implementation 
by Rischkc[20 . We shall refer to this implementation by the name of classical FORTRAN implementation, as it is 
coded completely in FORTRAN 77 compared to our C++ / OpenCL implementation. 

Although the reduction of complexity by neglecting turbulences and viscosity reduces the computational demands 
of the simulation, the running time, typically between 20 minutes and one hour per event for a (3 + 1) dimensional 
simulation, still prohibits the calculation of high statistics samples. As the SHASTA is well tested and suitable for 
heavy ion collisions, we have redesigned this algorithm to run on modern accelerator hardware in order to improve the 
performance. The presented OpenCL implementation allows us to harvest the computational power of accelerators, 
CPUs, and multi-core processors. After all improvements described in this paper, we gain a speed-up of up to a factor 
160 for the full propagation on a GPU, compared to the standard FORTRAN implementation used in UrQMD 3.3 on a 
CPU. For previous implementations of classical hydrodynamics algorithms on GPUs we refer the reader to [2"Trl24| . 



II. FORMULATION 

The hydro phase of heavy ion collisions is governed in principal by the Euler equations: 

(1) 
(2) 
(3) 

With p being the density, v the velocity vector, p the pressure, and e the energy density. The equations are closed by 
adding an equation of state. For the heavy ion collision case different equations of state have been proposed and can 
be used for the simulation. Different models use non analytic equations of state, and provide parametrized equations 
for the thermodynamical quantities. Various numerical interpolations schemes can be applied here and leave the 
possibility for additional optimizations on GPUs. For the present numerical study we restrict ourselves to the ideal 
gas equation of state. 

As high energy heavy ion collisions proceed near to the speed of light, relativistic effects have to be taken into 
account, which reformulates the equations to: 

« = (5) 
For the propagation of the energy-momentum tensor T^ v and the baryon current Jg. The energy-momentum-tensor 
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can be written as: 

7*1 ={e + p) 1 2 v i v j +p5 ij (6) 

T m ={s+p) 1 2 v l (7) 

T 00 =(e+pw 2 ) 7 2 (8) 

And taking Jg = (7/9, 7/3v) = (A/ - , A/v) the energy-momentum-tensor and the Baryon density can be written in terms 
like mass, momentum and energy: 

N = P1 (9) 
M=(e + p)7 2 v (10) 
£=(e+^ 2 ) 7 2 (11) 

Hence, following [251 ESL the relativistic Euler equations take the form similar to the non relativistic differential 
equations Q, Q, and d3]): 

J^Af + V-(AAv)=0 (12) 
^M + V- (Mvlpl) =0 (13) 
J^ + V.(v(£+p)) = (14) 

III. NUMERICAL METHOD 

The relativistic SHASTA [T2j 120] is composed of four phases: 

1. Geometric transport. 

2. Anti-Diffusion with flux limiter. 

3. Relativistic correction. 

4. Relativistic calculation of the equation of state. 

The geometric transport resembles a finite volume scheme by calculating fluxes between cell boundaries. For an 



arbitrary quantity U (standing for E, M., and Af of the equations ( |T2| ), ( 13 1), and ( 14 1) the fluxes to neighboring cells 
are calculated by geometrical factors. These factors are determined by the pressure and velocity of Lagrangian fluid 
parcells[in]. After the propagation step, the parcells are interpolated back an equidistant lattice with space index j 
and time index n. The geometrical factors including the interpolation back to the grid are given by [20 \ 

(15) 



2 T Vj -A 



i±(w"±! 5 -a-«; +5 ■ a) 

With A = ^ being the Courant-Friedrichs-Lewy number, and v the propagation velocity, calculated on a staggered 
grid in time direction. 

The propagation of energy and momenta consists not only of the material derivative, but an additional source 
term / originating from the acceleration of fluid particles by the pressure gradient. Thus, for the propagation of 



momenta (equation (13l) the additional divergence of the pressure tensor pi and for the propagation of energy 
(equation ( 14 )) the divergence of velocity and pressure vp have to be calculated in order to compute a time step. The 
source term's differential A(/) is also computed on a staggered grid in time direction. We use the central differential 

! , f n+: 

2 yjj+i 



A(/) = — \{f^+i — f?-i ) as proposed in [20]. The purely geometrically propagated quantity Uf takes the following 



form: 



A, = " Uj, (16) 
U? = \Q 2 + ^ - \Ql^-i + (Q+ + Q-)U? + AA(/). (17) 
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As the geometric transport produces inherently a certain amount of numeric diffusion the next step is to correct the 
results by an anti-diffusion terrrj^] A ph . The constraint to the anti-diffusion is not to create new maxima or minima 
on the grid. Thus, the anti-diffusion itself is limited by a flux limiter depending on a neighborhood of points. With 
the preceding definitions, a time step is completely determined by: 

4- = U J+ i - Uj (18) 

Af = \ (A, - i (A i+1 - 2A, + Aj_i)^ (19) 

a = sgaAf (20) 

A 3 =cr-max|o,min|CTA J+ i,|v4P h |,CTA :) _i|| (21) 

U? +1 = U?-A j +A 4 -i (22) 

In the relativistic case, all quantities are boosted from computational frame to their eigenffame in order to calculate 
the thermodynamic pressure, the baryo-chemical potential, and the propagation velocity. To obtain these quantities 
from the energy density e and the baryon density p we employ the ideal gas equation of state. Although the ideal 
gas equation enables an analytical solution to the calculation of the propagation velocity, we have used a fixed point 
root finding algorithm to allow for an easy implementation of a tabled equation of state without major changes in the 
codebase. 



A. Algorithm Design 



The usage of GPUs had been of great interest to the field of high energy physics even before state-of-the-art 
programming frameworks, like CUDA or OpenCL, have been developed [37]. In order to harvest best performance of 
GPUs one has to bear in mind certain architectural constraints of these devices. This implies often a very different 
approach than classical CPU programming. The implementation in this paper is designed in OpenCL and fitted to an 
AMD 5870 GPU. Therefore certain optimizations are also different [25] to typical optimizations carried out on NVIDIA 
GPUs. Although designed for this special kind of GPU the implementation still shows significant speed-up on CPU-only 
systems. Here additional optimizations, respecting caching and memory layout, e.g. interleaving the 3 dimensional 
grid variables, would mitigate the losses due cache misses when propagating in y or z direction, allow for further 
significant accelerations. 

The OpenCL-SHASTA consists of a C++-part, managing the memory allocation and enqueueing of the kernels. The 
kernels are routines written in OpenCL and completely run on the GPU or multi-core CPU. Kernels are executed in 
a parallel manner, and each singular instance of a kernel is called a work-item. These work-items are mapped, in 
hardware, to the stream cores of GPUs and CPUs. The mapping occurs in small groups, whose size depends on the 
hardware used. The smallest possible group is called wavefront. Within each wavefront the execution flow must be 
uniform, i.e. when branching occurs within a wavefront, all branches are computed serially. 

The approach to parallelize on GPUs must bear in mind, that the running program at the end consists not of a few 
parallel threads, but merely thousands of concurrent execution streams. Yet most of this streams are computing almost 
the same, which ranks this approach as a Single Instruction Multiple Data (SIMD) approach. In order to organize 
these thousands of execution streams the first, and arguably most important steps, are the problem decomposition 
and domain decomposition. Both points are somewhat entangled, still first order they can be handled separately. 



1. Problem Decomposition 



Firstly the dataflow of the algorithm has to be analyzed. Not only the physical quantities, that are finally stored, 
but also the steps in between and intermediate results of the computation are important. In figure [T] the dependencies 
within each timestep are illustrated. Before each quantity can be calculated, all the quantities pointed to, have to 
be computed. The dataflow and the spacial and causal dependencies of variables within build the constraints to 
any possible parallel computation. The SHASTA performs the propagation of the energy-momentum tensor and the 
net baryon current. Therefore five physical quantities {£, AA, and Af) are propagated (equations ([9]), (10), and 



Different anti-diffusion terms are possible. As stated in |20| the most suitable for this case is the so called phoenical anti-diffusion. 
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All physical quantities 




geometric propagated quantity 




FIG. 1. : Variable dependencies in data flow, time and space indices are omitted. (Each arrow reads as "depends on" 



(111). The full propagation of each quantity is done by the sequent steps from equation (15) to equation (22 1 



Subsequently relativistic corrections have to be applied, as well as the calculation of the thermodynamic quantities in 
their eigenframe. 

Most of the propagated quantities can be calculated, completely in parallel. As illustrated in figure [2j the propa- 
gation of each quantity, including the anti-diffusion and flux corrector, is calculated independently. Though, as the 
relativistic corrector needs the propagated state of all quantities (see also figure [T]) , the execution of the relativistic 
corrector and the calculation of the equation of state is scheduled after all five kernels, propagating the quantities 
independently, have finished computation. This is realized by the orchestrating method of the GPU-class (listing [2]). 

Another possible decomposition we have investigated on, is using the kernels firstly to calculate all the geometric 
propagated quantities (equation (16l) in parallel and subsequently applying new kernels for the anti-diffusion step. 
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FIG. 2. : Problem decomposition and data flow in OpenCL-SHASTA. 



Thereby one kernel can calculate more than one geometric propagated quantity (e.g. all the momenta together) which 
favors the usage of the vector units of GPUs and modern CPUq However this approach makes extensive usage of the 
GPU memory, by storing intermediate results to the global memory and reloading it in the anti-diffusion kernels later. 
In the execution model of OpenCL a global synchronization between work-units is not possible. Thus, different kernels 
have to be enqueued to serialize the task. 



2. Domain Decomposition 



For every grid based algorithm the domain decomposition is often suggested by the grid structure. Though the 
exact mapping of kernel number to grid size depends on hardware and algorithm type. The current implementation of 
OpenCL-SHASTA uses a kernel per physical quantity for each of the eight million grid cells cells. This choice works very 
well on GPUs. Optimizations aimed mainly on multi-core CPUs would do more efficiently using kernels responsible for 
more than one cell or more than one physical quantity per cell. Depending on the algorithm type, see section ["ill A 1| 
the dependencies within each grid cell might lead to even more kernels per grid cell on CPUs. The (3 + l)-dimensional 
problem of relativistic hydrodynamics is perfectly suitable to the three dimensional NDRange arguments in OpenCL. 
In order to implement the (3 + l)-dimensional propagation of the energy-momentum-tensor, we use a dimension split 
approach. Therefore each work-item needs for its calculation only a one dimensional neighborhood. The size of the 
1-D differential stencil still depends on the applied scheme, nevertheless this is a significant reduction of the buffer 
siz^] needed. In this implementation each work-item computes independently the geometric flux of all neighboring 
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FIG. 3. The neighborhood used to compute the propagated quantity in cell j (black) spans seven different cells. To compute 
the flux corrected value for cell j the geometric transport for cells (j — 2) . . . (j + 2) (grey) is needed. 



2 Almost every modern processor offers a richt set of vector units (SSE). 

3 number of registers 
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cells in order to calculate its flux corrector and save the flux corrected value. In figure [3] the needed neighborhood 
of each cell is illustrated: to compute the flux corrected value for cell j (black) each work-item has to calculate the 
geometric propagated quantity in cells (j — 2) . . . (j + 2) (grey). Therefore additionally the cells j — 3 and j + 3 have 
to be used. This is done in the for-loop serially and buffered in the flux [] -variable. However, the loop increases the 
computational workload of each kernel significantly, as each geometric flux has to be computed five times more often 
than in the serial implementation. An additional overhead is caused by the calculation of both limited anti-fluxes, 
i.e. Aj and in equation (22) in the variables ea and eb in listing TJ hence this step doubles the workload 

compared to the serial implementation. Although the alternative problem decomposition in section [ill A 1| avoids the 
multiple calculation of geometric fluxes and anti-fluxes, its overall performance proofed to be less due to the additional 
memory usage. This is because the additional computations allows each work-item to compute the propagated quantity 
independently from all other work-items, and thus no serialization e.g. between the geometric propagation and the 
anti-diffusion step is needed. 

For each quantity a specialized kernel (similar to listing]!]) is enqueued. To use the full capacity of a GPU, the kernel 
size must be chosen carefully. If the kernel's active working set is too big, only few work-items can run contemporary, 
as they must share the available buffer memory. (Not only the neighborhood (figure [3| of the propagated quantity is 
necessary, but additionally each work-item needs to access a similar neighborhood of velocity, pressure, and different 
control variables. On the other hand, if too many kernels are needed to compute the propagation, the increased 
number of enqueued kernels propose an additional orchestrating overhead. The use of mid weight kernels, computing 
all needed fluxes of a neighborhood but propagating only one quantity per kernel, allows for an efficient usage of the 
underlying hardware (all stream cores in parallel) without overly increasing the orchestration overhead. 



3. Branching Free Execution Flow 



We have designed the components of OpenCL-SHASTA in a modular way and implemented different kernels and 
auxiliary functions. This allows shorter development cycles and a ensures good validation of the algorithm. Addition- 
ally the kernels and auxiliary functions can be substituted even at run time leaving the choice of different anti-flux 
functions, different source terms, and even different equations of state. In (listing [l]) the constant dif f allows a fine 
controlled application of numerical viscosity. The value of the constant as well as the selection of more sophisticated 
or faster anti-diffusion routines are controlled by meta-programming[29 . As the kernels are compiled and loaded onto 
the GPU at run-time this choices do not infer additional branchings, which would slow the execution down. Never- 
theless a free selection of the desired numerical hydrodynamics realization is possible. Therefore a wide variety of 
calculations can be carried out by the suggested implementation without the need of complicated branching patterns 
within the computational relevant parts of the program. 

Let us stress the importance of avoiding (unnecessary) branching in GPU programs: current GPUs do not offer a 
branch predictor, therefore branching comes generally with a significant penalty on GPUs. (As stated in |28] up to 
a loss of 30% of the execution speed.) To avoid branchings we have designe specialized kernels for all quantities in 
OpenCL-SHASTA. The kernels vary according to their propagated quantity, source terms and propagation direction. 
For example the kernels responsible for the momenta parallel to the propagation direction have an additional source 
term, whereas momenta perpendicular to the propagation direction are transported source free. We use a number 
of halo cells to avoid complicated indexing for the finite difference schemes at the border of the grid. The kernel 
operates only within the boundaries of the inner grid, limited by grid size (GS) and halo size (HS). For stability reasons 
the code has been implemented based on a half-step method. The different half steps are implemented without a 
branching control structure, instead additional kernels change underlying control variables like A. We use a special 
double buffering scheme to spare expensive memory copies. Instead of copying onto different grids, we have designed 
to each kernel call an adjunct kernel call, which reverses the used buffers. The adjunct kernel calls are orchestrated 
in the launcher method (listing [2]) of the GPU-class. 

All different governing equations have been implemented in different kernels in order to stay clear of any branches 
within the execution flow. The orchestration of this set of 30 different kernel calls, is done by the execution method 
on the host (listing [2]) . Accordingly, the GPU can start the computation, while the correct kernels are still enqueued 
into the command que. No branching within the GPU is necessary. 



4 Complicated indexing methods not only imply branchings but often a very inefficient memory access. 
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FIG. 4. Total execution time for the expanding ball of r = 2 fm with constant energy density. The CPU and GPU are measured 
with the exact same OpenCL code and compared to the Fortran (FORT) implementation [T2l 120] . 



4- Memory Aware 3-D Calculation 

Due to the dimension split scheme each operation is executed only in one dimension at a time. This reduces the 
amount of registers needed for each kernel drastically, as only 1-D differential stencils (figure [3]) have to be calculated. 
This reduces the needed memory size to a third compared to a full 3-D stencil implementation. The propagation 
direction follows a fixed permutation of the three axesQ After the initial copy of all needed quantities to its private 
memory, the kernel (listing [TJ does not need any access to global memory for its calculations. On the contrary to 
the classical FORTRAN implementation no differentials have to be stored. Intermediate results, like the geometric 
flux, anti-diffusion, and flux-limiter can be stored in registers. The functions yielding the necessary geometric factors 
(qptO and qmtO) and the antiflux (antifluxO) are inlined functions. They are calculated exactly when needed 
(figure [T| and need not to be calculated in advance. (The former mentioned alternative problem decomposition 



(section IIIAl) needs additional global memory for the geometric fluxes, anti-fluxes, and flux limiters.) 

According to the permutation scheme only the propagation speed v\\ parallel to the actual propagation direction 
is calculated beforehand, which limits the global memory usage to only one field for the propagation speed. This 
approach underlines again the recompute instead of memory lookup paradigm which holds for various applications on 
many-core architectures 30 . The global memory footprint is determined by the seven quantities residing on a 200 3 
cell grid. As only single precision is needed to represent these quantities, the total memory consumption has been 
reduced: including the double buffering memory scheme, the total consumption is less than 500 MB. This allows to 
run the code even on commodity CPUs. By limiting the working set to a minimum and concentrating on recompute 
instead of expensive memory lookups, the remaining memory can be used to increase the grid size, enhance precision 
or to hold more complicated (tabled) equations of state, or even to enhance SHASTA with the necessary tables to 
calculate viscous hydro dynamics [311 152] , 



IV. RESULTS 

TM 

We have tested different examples on the LOEWE-CSC cluster using AMD 5870 CPUs and AMD Opteron 6172 
processors (24 core). Test cases included classic test problems, spheres of different metrics, and realistic initial 
conditions generated by UrQMD. For realistic cases the physical simulation time is on the order of 10 - 20 fm/c which 
transforms to 200 time steps (equalling to 16 fm/c in the current setup). 

The measured accelerations depend slightly on the geometry of the input. The best acceleration is achieved for the 
realistic UrQMD input files. Here the overall computing time for a physical running time of t = 8 fm/c for an Au+Au 



5 We found this approach more stable than a Monte Carlo approach. 
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FIG. 5. Total execution time for a Au+Au collision with -^/snn = 200 GeV. The CPU and GPU are measured with the exact 
same OpenCL code and compared to the FORTRAN (FORT) implementation [T2l |20] . 
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FIG. 6. The acceleration of the execution speed for different initial geometries. Here again the exact same code is executed 
on CPU and GPU. 



collision is reduced to less than 30 seconds. Compared to the standard FORTRAN implementation [121 I2TJ] which needs 
1 hour and 15 minutes, we find an acceleration of more than a factor of 160. 

In figure [4] and figure [5] the total execution time for two different initial geometries is depicted. Figure [4] shows the 
comparison for a spherical setup (ball, \\ ■ H2 ) of radius 2 fm. Figure [5] shows the comparison between the FORTRAN 
and OpenCL implementation on CPU/GPU for the UrQMD initial configuration for a Au+Au collision at v / §nn = 200 
GeV with impact parameter b = 7 fm. Even without a GPU at hand the OpenCL implementation provides a significant 

TM 

speed-up on CPUs. On the AMD Opteron 6172 processor (24 core), we measured an acceleration by a factor of 
ten. Although this is not very efficient, let us stress that a simple parallel execution of the standard FORTRAN 
implementation is not possible, as the memory consumption of the standard implementation is more than five times 
higher than the OpenCL implementation. Over the above the exact same implementation on GPU and CPU are compared 
here. Though the program design allows to choose appropriate kernels and environment variables for their execution, 
depending on the provided architecture (see section [ill A 1 and III A 2 ), enabling further speed-ups on CPUs. 

Figure [6] summarizes the increasing speed-up of the execution of the OpenCL implementation on the GPU. The 
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FIG. 7. Average time consumption for one time step in the expansion of a spherical symmetric system (ball) and for a Au+Au 
collision on the GPU. The n/af routines show calculations without anti-flux. 



difference between the geometries is reflected in the better distribution of computations to work-items. In the spherical 
symmetric case (denoted as ball) the initial geometry is concentrated in the center of the grid, therefore at the beginning 
all the workload is concentrated on few work-items. During the execution more cells are filled with non-zero values. 
As each quantity in each cell is computed separately in a work- item, a sparse grid is not very efficient, while a full 
grid makes best use of all the stream cores of a GPU. 

In figure [7] we observe the initial cost of memory transfer to the GPU, which becomes insignificant after a small 
number of time steps. Nevertheless time consumption increases again later, depending on the problem's geometry. 
One observes the impact of the complex anti-flux function on the average execution time. Without the complex anti- 
flux no increase can be observed and the acceleration due filling the grid takes fully place. In SHASTA the anti-flux is 
corrected by a flux limiter. This flux limiter is calculated by a search of maxima and minima of the surrounding cells 
and fluxes towards this cells. In this calculation branching is inherent and takes also place within different wavefronts. 
Therefore all branches within these wavefronts are calculated by the device and the correct results are gained by 
masking the wrong branches out. Hence the execution time is increased significantly, when the flux limiter is not 
uniform. 

Finally figure [8] shows the direct comparison between the present single precision implementation (full line) and 
the standard FORTRAN implementation (dotted line). For the realistic initial setup of a ^/snn = 200 GeV Au+Au 
collision provided by UrQMD we find only minor differences between both implementations. However, we work on a 
mixed precision implementation of OpenCL-SHASTA. In this implementation we use double precision for the less stable 
parts of the numerics, like the calculation of the boost. This becomes relevant at higher energies. 



V. SUMMARY AND OUTLOOK 



On current hardware, like the AMD 5870 GPU, double precision calculations come with a slowdown of a factor of 
five[^], additionally a full double precision approach doubles the memory consumption. We conclude: a single precision 
implementation allows for fast calculations, as well as the enhancements of the underlying physical model, e.g. by 
adding calculations of viscosity [3TJ|32]. If numerical instabilities occur, e.g. in the calculation of the Lorentz-boost 
7 we suggest the usage of mixed precision implementations. 

We have designed the OpenCL-SHASTA to work on commodity GPUs, nowadays present in almost every computeiQ 
As the OpenCL implementation allows for usage on GPUs, accelerators, as well as on classical multi-core processors, the 
OpenCL-SHASTA can be included in bigger frameworks, that need to be executed on a variety of different architectures. 



A special setup is needed to use double precision as well as the increased memory consumption. These setups are highly hardware 

dependent and future hardware may be more suited to double precision calculations. 

Even, where no such device is present, the implementation benefits of all present cores and vector units. 
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FIG. 8. Comparison between FORTRAN and OpenCL propagation for realistic initial conditions of a Au+Au collision with 
■\J snn = 200 GeV and impact parameter b = 7 fm at t — 8 fm/c provided by UrQMD. (The asymmetry present in both 
implementations is due to fluctuations in the initial state.) 



Due to the tremendous speed-up it resolves the problem of a computationally demanding hydrodynamic phase in 
hybrid models, like UrQMD, and allows for better statistics, stability analyses[33], and unprecedented event-by-event 
simulations. 
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Appendix A: Listings 



Listing 1. A generic kernel propagating source free quantities in X-direction. 

_kerncl void generic_X( __global float* In, __global float* Out, 

__global float* v, __global float* outcfl) 
uint idx = gc t _glob al _id ( ) ; 
uint idy = get _global_id ( 1 ) ; 
uint idz = get _global_id ( 2 ) ; 
uint myid = idx + idy * DY + idz * DZ ; 

if ( (idx >= HS) kk '(idy >= HS) kk (idz >= HS) kk (idx < GS - HS) 
&fc (idy < GS - HS) kk (idz < GS - HS) ){ 
const float cfl = *outcfl ; 
const float diff = l.Of; 

..private float basis [7] = {In [myid — 3] , In [myid— 2], In [myid — 1], 

In [myid], In [myid + 1] , In[myid + 2], In[myid + 3]}; 

__private float flux [5]; 

..private float velocity[7]= {v[myid — 3] ,v[myid — 2], v[myid — 1], 

v[myid], v [myid + 1], v[myid + 2], v[myid + 3]}; 

for (short i = 0; i < 5; ++i){ 

const float qpt = qp ( velocity , i+1, cfl ); 
const float qmt = qm( velocity , i+1, cfl ); 
flux [ i ] = 0.5f * qpt*qpt * ( basis [ i+2] — basis [ i +1]) 

— 0.5f * qmt*qmt * ( basis [ i+1] — basis [ i ] ) + (qpt+qmt)* basis [i+1]; 

} 

const float ca = antiflux ( flux , basis, 0, diff); 
const float cb = antiflux ( flux , basis, —1, diff); 
Out[myid] = flux [2] — ca + eb ; 
} else Out [myid] = In [myid]; 



Listing 2. A part of the enqueueing routine. 

const int branch[] = {0,1,2, 0,2,1, 2,0,1, 2,1,0, 1,2,0, 1,0,2}; 
for (int round = 0; round < steps; ++round) { 
i = branch [round % 18]; 
if ( i = ) { 

queue . enqueueNDRangeKernel (untangT , cl : : NullRange , 

c 1 : : ND Range ( GS , GS , GS ) , c 1 :: NullRange , 
NULL, NULL); 

queue . enqueueNDRangeKernel ( half _cfl , cl :: NullRange , 

cl : :NDRange(l ,0 ,0) , cl :: NullRange , 
NULL, NULL); 

queue . enqueueNDRangeKernel ( propX_e , cl : : NullRange , 

c 1 : : ND Range ( GS , GS , GS ) , c 1 :: NullRange , 
NULL, NULL); 

queue . enqueueNDRangeKernel ( prop_parallel_mx , cl : : NullRange , 

c 1 : : ND Range ( GS , GS , GS ) , c 1 :: NullRange , 
NULL, NULL); 

queue . enqueueNDRangeKernel (propX.my , cl : : NullRange , 

c 1 : : ND Range ( GS , GS , GS ) , c 1 :: NullRange , 
NULL, NULL); 
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Appendix B: Parallelization Guidelines 

Of course no simple and general mechanism of parallelizing algorithms can be stated. Nevertheless, most of the 
steps we investigated on for the OpenCL-SHASTA can be applied to similar algorithms in this field. Therefore we 
state here a best practice approach, we believe can be followed by other research groups, who wish to port their code 
on GPUs. 



1. Problem Decomposition 

• Physical quantities, that by superposition are not dependent on each other, are the first choice to be computed 
in parallel with different kernels. (These are e.g. £ , A4, and Af.) 

• If the independence between variables can be gained by a finer time step resolution, the additional execution 
time of more time steps is easily compensated by the parallelization. 

• Not only physical quantities can be parallelized. Also intermediate results within a numerical scheme can often 
be calculated in parallel. 

• A data dependence graph, like figure [l] helps to group calculations to kernels. (Often the connectivity between 
cliques is a first hint to beneficial cuts) 



2. Domain Decomposition 

• In grid based algorithms a decomposition to work-items for each grid point is a natural choice. (We group 
206 x 206 x 206 work-units per kernel to the queue, for the 200 3 -cell grid, adding halo cells.) 

• The aim of the decomposition is to minimize the communication between work-items: numerical schemes, 
that need a high communication between grid cells might be more efficient with work-items handling small 
neighborhoods of the grid. (In OpenCL-SHASTA's kernels the initial step is to store all necessary quantities in 
local memory, thus no further communication to other cells is necessary.) 

• In OpenCL exists the possibility to group kernels to workgroups, simplifying and accelerating communication 
within work-items of workgroup. 

• Here also a data dependence graph for the spacial dependencies can show how a useful grouping can be done. 

• In algorithms not based on a grid, but e.g. test-particles a possible choice is to map the number of particles to 
work-items. 

• The ordering of the particles should mimic the spacial ordering (in phase-space) of the particles, to benefit from 
potential cache and coalescent memory lookup effects. 



3. Kernel Design Criteria 

• Depending on the number of work-items that shall be executed in parallel, the active working set of a kernel 
may not be too big, as the ressources are divided by all contemporary executed kernels. (We chose to compute 
only one physical quantity per kernel, as many fluxes have to be computed. Algorithms with a smaller update 
step maybe more efficient e.g. by computing the momenta together.) 

• Kernels should be as independent as possible, synchronization between virtually thousands of work-items is 
cumbersome and error prone. (We have chosen to calculate all fluxes needed within each kernel, thus no 
communication was necessary.) 

• A favorable approach is to start with smallest possible kernels and merging kernels whilst measuring the execution 
time. (For example in the OpenCL-SHASTA we started with extra kernels calculating the geometric flux and kernels 
calculating the anti-flux. The merge of both kernels proofed to be more efficient.) 
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4. Hardware Specific Optimizations 

Branching within kernels is most detrimental on GPUs, codes designed for CPUs are less affected by branchings. 

Often the needed branchings in physical models follow a fixed scheme completely determined before execution, 
e.g. the model's choice of the EoS. Therefore it should be encoded with different kernels and orchestrated in 
advance. (The kernels for velocity and pressure can easily be substituted before they are loaded to the GPU.) 

Memory access is always a critical point: it should be avoided wherever possible. (Many simulation codes save 
differentials in arrays, they can be calculated directly) 

Floating point calculations are highly efficient on GPUs. Even transcendental calculations should not be stored 
in arrays with intermediate results. 

Instead the use of inlined functions (like e.g. qp and qm in OpenCL-SHASTA) allows for a clean code and an 
efficient execution. 

Caching might be an issue on special hardware, spacial locality ensures often an efficient usage of present cache 
structures. 

Coalesced memory access is also often gained by a correctly ordered enqueueing. (Spacial neighbors should have 
neighboring work-item-ids.) 



